EDA for Country-Level Data

With the dataset I had, the list of the world countries and their happiness scores, I wanted to look at what factors were the most influential in determining happiness scores. What I first did was filter out variables that had a lot of NAs.

Code
# Load libraries
library(tidyverse)
library(sf)
library(spdep)
library(viridis)
library(readxl)
library(rnaturalearth)
library(rnaturalearthdata)
library(corrplot)
Code
# Load spatial and happiness datasets
world_sf <- rnaturalearth::ne_countries(scale = "medium", returnclass = "sf")
happiness_data <- read_excel("../data/WHR20_DataForTable2.1.xls")
Code
# Summary of numeric variables
summary(select_if(happiness_data, is.numeric))
      year       Life Ladder    Log GDP per capita Social support  
 Min.   :2005   Min.   :2.375   Min.   : 6.457     Min.   :0.2902  
 1st Qu.:2010   1st Qu.:4.623   1st Qu.: 8.309     1st Qu.:0.7483  
 Median :2013   Median :5.363   Median : 9.408     Median :0.8340  
 Mean   :2013   Mean   :5.446   Mean   : 9.245     Mean   :0.8111  
 3rd Qu.:2016   3rd Qu.:6.268   3rd Qu.:10.209     3rd Qu.:0.9046  
 Max.   :2019   Max.   :8.019   Max.   :11.728     Max.   :0.9873  
                                NA's   :29         NA's   :13      
 Healthy life expectancy at birth Freedom to make life choices
 Min.   :32.30                    Min.   :0.2575              
 1st Qu.:58.30                    1st Qu.:0.6431              
 Median :65.10                    Median :0.7575              
 Mean   :63.17                    Mean   :0.7385              
 3rd Qu.:68.39                    3rd Qu.:0.8524              
 Max.   :77.10                    Max.   :0.9852              
 NA's   :52                       NA's   :31                  
   Generosity       Perceptions of corruption Positive affect 
 Min.   :-0.33177   Min.   :0.0352            Min.   :0.3217  
 1st Qu.:-0.11718   1st Qu.:0.6927            1st Qu.:0.6233  
 Median :-0.02372   Median :0.8036            Median :0.7208  
 Mean   : 0.00011   Mean   :0.7491            Mean   :0.7096  
 3rd Qu.: 0.09115   3rd Qu.:0.8737            3rd Qu.:0.8011  
 Max.   : 0.67992   Max.   :0.9833            Max.   :0.9436  
 NA's   :83         NA's   :103               NA's   :21      
 Negative affect   Confidence in national government Democratic Quality
 Min.   :0.08343   Min.   :0.06877                   Min.   :-2.4483   
 1st Qu.:0.20595   1st Qu.:0.33641                   1st Qu.:-0.7885   
 Median :0.25577   Median :0.46479                   Median :-0.2259   
 Mean   :0.26721   Mean   :0.48316                   Mean   :-0.1350   
 3rd Qu.:0.31800   3rd Qu.:0.61626                   3rd Qu.: 0.6531   
 Max.   :0.70459   Max.   :0.99360                   Max.   : 1.5825   
 NA's   :15        NA's   :191                       NA's   :149       
 Delivery Quality   Standard deviation of ladder by country-year
 Min.   :-2.14497   Min.   :0.863                               
 1st Qu.:-0.71114   1st Qu.:1.747                               
 Median :-0.21663   Median :1.987                               
 Mean   :-0.00209   Mean   :2.052                               
 3rd Qu.: 0.69965   3rd Qu.:2.280                               
 Max.   : 2.18472   Max.   :4.073                               
 NA's   :148                                                    
 Standard deviation/Mean of ladder by country-year
 Min.   :0.1339                                   
 1st Qu.:0.3107                                   
 Median :0.3756                                   
 Mean   :0.3970                                   
 3rd Qu.:0.4632                                   
 Max.   :1.0228                                   
                                                  
 GINI index (World Bank estimate)
 Min.   :0.240                   
 1st Qu.:0.309                   
 Median :0.355                   
 Mean   :0.372                   
 3rd Qu.:0.430                   
 Max.   :0.634                   
 NA's   :1133                    
 GINI index (World Bank estimate), average 2000-2017, unbalanced panel
 Min.   :0.2495                                                       
 1st Qu.:0.3223                                                       
 Median :0.3688                                                       
 Mean   :0.3853                                                       
 3rd Qu.:0.4329                                                       
 Max.   :0.6240                                                       
 NA's   :180                                                          
 gini of household income reported in Gallup, by wp5-year
 Min.   :0.2010                                          
 1st Qu.:0.3696                                          
 Median :0.4282                                          
 Mean   :0.4490                                          
 3rd Qu.:0.5171                                          
 Max.   :0.9614                                          
 NA's   :370                                             
 Most people can be trusted, Gallup
 Min.   :0.0666                    
 1st Qu.:0.1398                    
 Median :0.1984                    
 Mean   :0.2263                    
 3rd Qu.:0.2816                    
 Max.   :0.6403                    
 NA's   :1668                      
 Most people can be trusted, WVS round 1981-1984
 Min.   :0.1765                                 
 1st Qu.:0.2903                                 
 Median :0.3802                                 
 Mean   :0.3902                                 
 3rd Qu.:0.4781                                 
 Max.   :0.5717                                 
 NA's   :1712                                   
 Most people can be trusted, WVS round 1989-1993
 Min.   :0.0660                                 
 1st Qu.:0.2236                                 
 Median :0.2924                                 
 Mean   :0.2838                                 
 3rd Qu.:0.3417                                 
 Max.   :0.5946                                 
 NA's   :1611                                   
 Most people can be trusted, WVS round 1994-1998
 Min.   :0.0487                                 
 1st Qu.:0.1769                                 
 Median :0.2299                                 
 Mean   :0.2498                                 
 3rd Qu.:0.2942                                 
 Max.   :0.6477                                 
 NA's   :1181                                   
 Most people can be trusted, WVS round 1999-2004
 Min.   :0.0759                                 
 1st Qu.:0.1558                                 
 Median :0.2320                                 
 Mean   :0.2680                                 
 3rd Qu.:0.3855                                 
 Max.   :0.6372                                 
 NA's   :1319                                   
 Most people can be trusted, WVS round 2005-2009
 Min.   :0.0382                                 
 1st Qu.:0.1435                                 
 Median :0.1984                                 
 Mean   :0.2643                                 
 3rd Qu.:0.3914                                 
 Max.   :0.7373                                 
 NA's   :1164                                   
 Most people can be trusted, WVS round 2010-2014
 Min.   :0.0315                                 
 1st Qu.:0.1187                                 
 Median :0.1935                                 
 Mean   :0.2372                                 
 3rd Qu.:0.3350                                 
 Max.   :0.6618                                 
 NA's   :1124                                   
Code
# Check for missing values
missing_data <- sapply(happiness_data, function(x) sum(is.na(x)))
print(missing_data)
                                                         Country name 
                                                                    0 
                                                                 year 
                                                                    0 
                                                          Life Ladder 
                                                                    0 
                                                   Log GDP per capita 
                                                                   29 
                                                       Social support 
                                                                   13 
                                     Healthy life expectancy at birth 
                                                                   52 
                                         Freedom to make life choices 
                                                                   31 
                                                           Generosity 
                                                                   83 
                                            Perceptions of corruption 
                                                                  103 
                                                      Positive affect 
                                                                   21 
                                                      Negative affect 
                                                                   15 
                                    Confidence in national government 
                                                                  191 
                                                   Democratic Quality 
                                                                  149 
                                                     Delivery Quality 
                                                                  148 
                         Standard deviation of ladder by country-year 
                                                                    0 
                    Standard deviation/Mean of ladder by country-year 
                                                                    0 
                                     GINI index (World Bank estimate) 
                                                                 1133 
GINI index (World Bank estimate), average 2000-2017, unbalanced panel 
                                                                  180 
             gini of household income reported in Gallup, by wp5-year 
                                                                  370 
                                   Most people can be trusted, Gallup 
                                                                 1668 
                      Most people can be trusted, WVS round 1981-1984 
                                                                 1712 
                      Most people can be trusted, WVS round 1989-1993 
                                                                 1611 
                      Most people can be trusted, WVS round 1994-1998 
                                                                 1181 
                      Most people can be trusted, WVS round 1999-2004 
                                                                 1319 
                      Most people can be trusted, WVS round 2005-2009 
                                                                 1164 
                      Most people can be trusted, WVS round 2010-2014 
                                                                 1124 
Code
# Visualize missing data
library(DataExplorer)
plot_missing(happiness_data)

After filtering out the variables, I looked into the correlation matrix of the variables, which revealed a strong positive relationship between the happiness scores and “Log GDP per Capita,” “Healthy Life Expectancy,” and “Social Support.” What was interesting to find was that “Delivery Quality” was highly and positively correlated with the happiness scores. Additionally, negative correlations with “Perceptions of Corruption” and “gini of household income” underscore the importance of effective governance, which was presumable.

Code
# Select relevant columns
happiness_data <- happiness_data %>%
  select(
    `year`, `Country name`, `Life Ladder`, `Log GDP per capita`, `Social support`, `Freedom to make life choices`, `Healthy life expectancy at birth`,
    `Generosity`, `Perceptions of corruption`, `Democratic Quality`, `Delivery Quality`, `GINI index (World Bank estimate), average 2000-2017, unbalanced panel`,
    `Confidence in national government`, `gini of household income reported in Gallup, by wp5-year`
  )

happiness_data <- happiness_data %>% 
  rename(`gini of income` = `gini of household income reported in Gallup, by wp5-year`)
Code
# Ensure the Life Ladder column is numeric
happiness_data <- happiness_data %>%
  mutate(`Life Ladder` = as.numeric(`Life Ladder`))

# Distribution of happiness scores
ggplot(happiness_data, aes(x = `Life Ladder`)) +
  geom_histogram(binwidth = 0.5, fill = "blue", alpha = 0.7) +
  theme_minimal() +
  labs(title = "Distribution of Happiness Scores", x = "Happiness Score", y = "Frequency")

Code
# Correlation matrix for numeric variables
cor_matrix <- cor(select_if(happiness_data, is.numeric), use = "complete.obs")

# Plot the correlation matrix with adjusted text size
corrplot(cor_matrix, method = "circle", type = "upper", tl.cex = 0.4)

Code
key_variables <- c(
  "Life Ladder", "Log GDP per capita", "Delivery Quality",
  "Perceptions of corruption", "gini of income"
)

# Pairwise scatterplots for key variables
pairs(
  happiness_data[key_variables],
  main = "Pairwise Scatterplots of Key Variables",
  pch = 19,
  col = rgb(0, 0, 0, alpha = 0.5)
)

Code
# Boxplot of key variables to identify outliers
key_variables <- c(
  "Life Ladder", "Log GDP per capita", "Delivery Quality",
  "Perceptions of corruption", "gini of income"
)

happiness_data %>%
  select(all_of(key_variables)) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") %>%
  ggplot(aes(x = Variable, y = Value)) +
  geom_boxplot(fill = "skyblue") +
  theme_minimal() +
  labs(title = "Boxplots of Key Variables", y = "Value")

Then, I checked if there were any outliers present in these variables in order to get a better analysis of the data. Yet, there didn’t seem to be any prominent outliers that seem to lead to a distorted correlation matrix.

Code
# Average happiness scores over time
happiness_data %>%
  group_by(year) %>%
  summarize(Average_Happiness = mean(`Life Ladder`, na.rm = TRUE)) %>%
  ggplot(aes(x = year, y = Average_Happiness)) +
  geom_line(color = "blue", size = 1) +
  geom_point(color = "red", size = 2) +
  theme_minimal() +
  labs(title = "Average Happiness Scores Over Time", x = "Year", y = "Average Happiness Score")

Next, the temporal analysis showed a gradual upward trend in average happiness scores from 2005 to 2019. As there was a sudden decline in the happiness scores from 2005 to 2006, I looked into detail what was the reason for this sudden change.

Code
# Filter data for the years 2006 and 2007
happiness_2005 <- happiness_data %>%
  filter(year %in% c(2005))

# Calculate average happiness scores by country and year
average_scores <- happiness_2005 %>%
  group_by(`Country name`, year) %>%
  summarize(
    avg_happiness = mean(`Life Ladder`, na.rm = TRUE)
  ) %>%
  arrange(year, desc(avg_happiness)) # Sort by year and descending average happiness

# View the result
print(average_scores)
# A tibble: 27 × 3
# Groups:   Country name [27]
   `Country name`  year avg_happiness
   <chr>          <dbl>         <dbl>
 1 Denmark         2005          8.02
 2 Netherlands     2005          7.46
 3 Canada          2005          7.42
 4 Sweden          2005          7.38
 5 Australia       2005          7.34
 6 Belgium         2005          7.26
 7 Venezuela       2005          7.17
 8 Spain           2005          7.15
 9 France          2005          7.09
10 Saudi Arabia    2005          7.08
# ℹ 17 more rows
Code
happiness_2006 <- happiness_data %>%
  filter(year %in% c(2006))

# Calculate average happiness scores by country and year
average_scores <- happiness_2006 %>%
  group_by(`Country name`, year) %>%
  summarize(
    avg_happiness = mean(`Life Ladder`, na.rm = TRUE)
  ) %>%
  arrange(year, desc(avg_happiness)) # Sort by year and descending average happiness

# View the result
print(average_scores)
# A tibble: 89 × 3
# Groups:   Country name [89]
   `Country name`        year avg_happiness
   <chr>                <dbl>         <dbl>
 1 Finland               2006          7.67
 2 Switzerland           2006          7.47
 3 Norway                2006          7.42
 4 New Zealand           2006          7.31
 5 United States         2006          7.18
 6 Israel                2006          7.17
 7 Ireland               2006          7.14
 8 Austria               2006          7.12
 9 Costa Rica            2006          7.08
10 United Arab Emirates  2006          6.73
# ℹ 79 more rows
Code
# # Remove rows where the year is 2005
# filtered_data <- happiness_data %>%
#   filter(year != 2005)
Code
# Average happiness scores by country
average_happiness <- happiness_data %>%
  group_by(`Country name`) %>%
  summarize(Average_Happiness = mean(`Life Ladder`, na.rm = TRUE)) %>%
  arrange(desc(Average_Happiness))

print(head(average_happiness, 10)) # Top 10 happiest countries
# A tibble: 10 × 2
   `Country name` Average_Happiness
   <chr>                      <dbl>
 1 Denmark                     7.69
 2 Finland                     7.57
 3 Switzerland                 7.55
 4 Norway                      7.54
 5 Netherlands                 7.46
 6 Iceland                     7.43
 7 Canada                      7.40
 8 Sweden                      7.37
 9 New Zealand                 7.31
10 Australia                   7.29
Code
print(tail(average_happiness, 10)) # Bottom 10 happiest countries
# A tibble: 10 × 2
   `Country name`           Average_Happiness
   <chr>                                <dbl>
 1 Comoros                               3.94
 2 Zimbabwe                              3.93
 3 Yemen                                 3.91
 4 Tanzania                              3.69
 5 Rwanda                                3.65
 6 Afghanistan                           3.59
 7 Togo                                  3.56
 8 Burundi                               3.55
 9 Central African Republic              3.51
10 South Sudan                           3.40

After filtering for specific years, 2005 and 2006, I noticed that the number of data for 2005 was lacking, and the countries included in the data for the year 2005 were those with high happiness scores. We could easily see that Denmark, Finland, and Switzerland were consistently high-ranking countries, while South Sudan and Afghanistan remained at the bottom due to systemic challenges like conflict and poverty.

Code
# Pairwise scatterplots for key variables
pairs(
  select(happiness_data, `Life Ladder`, `Log GDP per capita`, `Delivery Quality`, `Healthy life expectancy at birth`),
  main = "Pairwise Scatterplots of Key Variables",
  pch = 19,
  col = rgb(0, 0, 0, alpha = 0.5)
)

Code
# Scatterplot matrix for selected variables
ggplot(happiness_data, aes(x = `Log GDP per capita`, y = `Life Ladder`)) +
  geom_point(alpha = 0.7, color = "darkblue") +
  geom_smooth(method = "lm", color = "red", se = TRUE) +
  theme_minimal() +
  labs(
    title = "Happiness vs. Log GDP per Capita",
    x = "Log GDP per Capita",
    y = "Happiness Score"
  )

Code
# Scatterplot of Happiness vs. Delivery Quality
ggplot(happiness_data, aes(x = `Delivery Quality`, y = `Life Ladder`)) +
  geom_point(alpha = 0.7, color = "green") +
  geom_smooth(method = "lm", color = "red", se = TRUE) +
  theme_minimal() +
  labs(
    title = "Happiness vs. Delivery Quality",
    x = "Delivery Quality",
    y = "Happiness Score"
  )

Code
# Scatterplot of Happiness vs. Gini Income
ggplot(happiness_data, aes(x = `gini of income`, y = `Life Ladder`)) +
  geom_point(alpha = 0.7, color = "green") +
  geom_smooth(method = "lm", color = "red", se = TRUE) +
  theme_minimal() +
  labs(
    title = "Happiness vs. Gini Income",
    x = "GINI Income",
    y = "Happiness"
  )

Scatterplots of happiness against key predictors reveal actionable insights. The positive relationship between happiness and “Log GDP per Capita” reinforces the importance of economic growth, while the negative correlation with the “GINI Index” highlights the adverse effects of income inequality. Similarly, variables such as “Delivery Quality” show moderate to strong positive correlations with happiness, underlining the role of governance and personal agency in improving well-being. With this EDA process, I could determine which variables to focus on for further spatial analyses.